Name: Divyanshu Singh, Asang Kumar & Jaspreet Singh
Group No: 86
Dataset: Water Potability Prediction Dataset

Water Potability Prediction Assignment 1 – PS-2

Problem Statement¶

Access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection. This is important as a health and development issue at a national, regional and local level. The objective is to predict the potability of water based on given features. The water_potability.csv file contains water quality metrics for 3276 different water bodies

1. Importing Libraries/Dataset¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve
from sklearn.model_selection import cross_validate, cross_val_score
In [2]:
df = pd.read_csv('../water_potability.csv')
In [3]:
df
Out[3]:
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity Potability
0 NaN 204.890455 20791.318981 7.300212 368.516441 564.308654 10.379783 86.990970 2.963135 0
1 3.716080 129.422921 18630.057858 6.635246 NaN 592.885359 15.180013 56.329076 4.500656 0
2 8.099124 224.236259 19909.541732 9.275884 NaN 418.606213 16.868637 66.420093 3.055934 0
3 8.316766 214.373394 22018.417441 8.059332 356.886136 363.266516 18.436524 100.341674 4.628771 0
4 9.092223 181.101509 17978.986339 6.546600 310.135738 398.410813 11.558279 31.997993 4.075075 0
... ... ... ... ... ... ... ... ... ... ...
3271 4.668102 193.681735 47580.991603 7.166639 359.948574 526.424171 13.894419 66.687695 4.435821 1
3272 7.808856 193.553212 17329.802160 8.061362 NaN 392.449580 19.903225 NaN 2.798243 1
3273 9.419510 175.762646 33155.578218 7.350233 NaN 432.044783 11.039070 69.845400 3.298875 1
3274 5.126763 230.603758 11983.869376 6.303357 NaN 402.883113 11.168946 77.488213 4.708658 1
3275 7.874671 195.102299 17404.177061 7.509306 NaN 327.459760 16.140368 78.698446 2.309149 1

3276 rows × 10 columns

2. Data Visualization and Exploration¶

2.1 Sanity Check¶

In [4]:
# Print the first two rows
print(df.head(2))
        ph    Hardness        Solids  Chloramines     Sulfate  Conductivity  \
0      NaN  204.890455  20791.318981     7.300212  368.516441    564.308654   
1  3.71608  129.422921  18630.057858     6.635246         NaN    592.885359   

   Organic_carbon  Trihalomethanes  Turbidity  Potability  
0       10.379783        86.990970   2.963135           0  
1       15.180013        56.329076   4.500656           0  

2.2 Checking the class imbalance¶

In [5]:
labels = ['Non-Potable', 'Potable']
data = [df['Potability'].value_counts()[0],
         df['Potability'].value_counts()[1]
        ]
fig1, ax1 = plt.subplots(figsize=[15,6])
color = ['red', 'blue']
ax1.pie(data, labels=labels,explode=[0.05]*2, autopct='%1.1f%%',pctdistance=0.5, shadow=True, colors = color)
plt.title("Water Potability", fontsize=20);
plt.show()
In [6]:
df.Potability.value_counts()
Out[6]:
0    1998
1    1278
Name: Potability, dtype: int64

Comment on class imbalance with appropriate visualization method.
Based on the class distribution you provided (1998 (61%) non-potable and 1278 (39%) potable), it indicates a class imbalance in the target variable.

In this case, the majority class is the "non-potable" class, which constitutes 61% of the dataset, while the minority class is the "potable" class, making up only 39% of the dataset.

Class imbalances can impact the performance of machine learning models, particularly when the minority class is underrepresented. Models trained on imbalanced data tend to be biased towards the majority class, leading to lower accuracy and predictive performance on the minority class.

Logically, we would prefer a model that will have more false negatives rather than a model that has more false positives.

2.3 Visualizations to get an insight about the dataset¶

In [7]:
plt.style.use('seaborn-dark')
colors=['#00a8e8', '#00afb9',  '#48bfe3', '#006e90', '#20a4f3', '#00b4d8', '#0466c8', '#20a4f3', '#00008B','#1E90FF']

i = 0
while i < 10:
    for col in df.columns:
        plt.figure(figsize=(6, 4))
        sns.displot(df[col], color=colors[i], kde=True)
        plt.title(f'Distribution plot for {col}')
        plt.xlabel(f'Skewness = {round(df[col].skew(), 3)}', fontsize=14)
        i += 1
        plt.show()
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>
<Figure size 600x400 with 0 Axes>

Relationship Between Each Variable and Target Variable (Potability)

In [8]:
# Iterate over each feature column
for col in df.columns[:-1]:
    plt.figure(figsize=(8, 6))
    sns.boxplot(data=df, x="Potability", y=col)
    plt.title(f"Box Plot of {col} with Potability", size=10)
    plt.show()
In [9]:
sns.pairplot(df,hue='Potability')
Out[9]:
<seaborn.axisgrid.PairGrid at 0x223e3435eb0>

Checking for the missing/NAN values.

In [10]:
plt.title('Missing Values Per Feature')
nans = df.isna().sum().sort_values(ascending=False).to_frame()
sns.heatmap(nans,annot=True,fmt='d',cmap='vlag')
Out[10]:
<AxesSubplot:title={'center':'Missing Values Per Feature'}>

2.4 Correlational analysis on the dataset¶

In [11]:
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap="Reds")
plt.title("Correlations Between Variables", size=16)
plt.show()
In [12]:
corr_matrix = df.corr()
corr_matrix["Potability"].sort_values(ascending=False)
Out[12]:
Potability         1.000000
Solids             0.033743
Chloramines        0.023779
Trihalomethanes    0.007130
Turbidity          0.001581
ph                -0.003556
Conductivity      -0.008128
Hardness          -0.013837
Sulfate           -0.023577
Organic_carbon    -0.030001
Name: Potability, dtype: float64

Observation¶

The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation. When it is close to -1, it means that there is a negative correlation.
It appears that there is no linear/ranked correlation between our output label and our features. So we can say that all the features are important and we cannot drop any of the feature. So initial thoughts are this data cant be modeled the simple classifier, as its highly tangled and non linear in nature. In other words, this is not going to have effect on feature selection.

3. Data Pre-processing and cleaning¶

3.1 Identifying NULL or Missing Values¶

In [13]:
plt.title('Missing Values Per Feature')
nans = df.isna().sum().sort_values(ascending=False).to_frame()
sns.heatmap(nans,annot=True,fmt='d',cmap='vlag')
Out[13]:
<AxesSubplot:title={'center':'Missing Values Per Feature'}>

Observation:
Overall data is too small, so removing the null or missing values will make it further less which we don't want. So we'll fill those NAN/missung values. We can use techniques like mean, median, or mode imputation to fill in missing values using "fillna()" in pandas.\

We are going with the mean beacuse of the below reasons:
Mean imputation is often used when the missing values are numerical and the distribution of the variable is approximately normal.
Median imputation is preferred when the distribution is skewed, as the median is less sensitive to outliers than the mean.
Mode imputation is suitable for categorical variables or numerical variables with a small number of unique values.
Because our dataset has a normal distribution, so Mean imputation would be better option.

In [14]:
df.fillna(df.mean(), inplace=True)
df.head()
Out[14]:
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity Potability
0 7.080795 204.890455 20791.318981 7.300212 368.516441 564.308654 10.379783 86.990970 2.963135 0
1 3.716080 129.422921 18630.057858 6.635246 333.775777 592.885359 15.180013 56.329076 4.500656 0
2 8.099124 224.236259 19909.541732 9.275884 333.775777 418.606213 16.868637 66.420093 3.055934 0
3 8.316766 214.373394 22018.417441 8.059332 356.886136 363.266516 18.436524 100.341674 4.628771 0
4 9.092223 181.101509 17978.986339 6.546600 310.135738 398.410813 11.558279 31.997993 4.075075 0
In [15]:
df.isnull().sum()
Out[15]:
ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64

Dealing with outliers

What is an outlier??: Values that are far away from the mean. Box plot can give us a better view about the outliers.

Outlier Using The Box Plot

In [16]:
df.boxplot(figsize=(14,7))
Out[16]:
<AxesSubplot:>

There is an outlier and it is found to be in "Solids".

In [17]:
df.describe()
Out[17]:
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity Potability
count 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000
mean 7.080795 196.369496 22014.092526 7.122277 333.775777 426.205111 14.284970 66.396293 3.966786 0.390110
std 1.469956 32.879761 8768.570828 1.583085 36.142612 80.824064 3.308162 15.769881 0.780382 0.487849
min 0.000000 47.432000 320.942611 0.352000 129.000000 181.483754 2.200000 0.738000 1.450000 0.000000
25% 6.277673 176.850538 15666.690297 6.127421 317.094638 365.734414 12.065801 56.647656 3.439711 0.000000
50% 7.080795 196.967627 20927.833607 7.130299 333.775777 421.884968 14.218338 66.396293 3.955028 0.000000
75% 7.870050 216.667456 27332.762127 8.114887 350.385756 481.792304 16.557652 76.666609 4.500320 1.000000
max 14.000000 323.124000 61227.196008 13.127000 481.030642 753.342620 28.300000 124.000000 6.739000 1.000000
In [18]:
df['Solids'].describe()
Out[18]:
count     3276.000000
mean     22014.092526
std       8768.570828
min        320.942611
25%      15666.690297
50%      20927.833607
75%      27332.762127
max      61227.196008
Name: Solids, dtype: float64

The maximum values in "Solids" is way ahead compared to the 75%. But for now we'll leaving the data as it is. Solids is an important attribute for deciding the water potability that's why it is kept as it is.

Skewness Correction

Skewness of a distribution is defined as the lack of symmetry. In a symmetrical distribution, the Mean, Meadian and Mode are equal.The normal distribution has a skewness of 0.
Skewness is of two types
Positive skewness: When the tail on the right side of the distribution is longer or fatter, we say the data is positively skewed. For a positive skewness mean > median > mode.
Negative skewness: When the tail on the left side of the distribution is longer or fatter, we say that the distribution is negatively skewed. For a negative skewness mean < median < mode.
The thumb rule is:
If the skewness is between -0.5 to +0.5, then data is fairly symmetrical.
If the skewness is between -1 to -0.5 or 0.5 to 1 then data is moderately skewed.
if the skewness is less than -1 and greater than +1 then our data is heavily skewed.

In [19]:
# Calculate skewness for each feature
skewness = df.skew()

# Print the skewness values
for feature, skew in skewness.items():
    print(f"Skewness of {feature}: {skew}")

# Determine the number of rows and columns for subplots
n_features = len(df.columns)
n_rows = int(n_features / 3) + (n_features % 3 > 0)
n_cols = 3

# Visualize the transformed distributions with normal distribution curve
plt.figure(figsize=(15, 10))
for i, feature in enumerate(df.columns):
    plt.subplot(n_rows, n_cols, i + 1)
    sns.histplot(df[feature], kde=True)
    plt.xlabel(feature)
    plt.ylabel("Frequency")

plt.tight_layout()
plt.show()
Skewness of ph: 0.027795881574044657
Skewness of Hardness: -0.0393417047767249
Skewness of Solids: 0.6216344855169127
Skewness of Chloramines: -0.01209844012438221
Skewness of Sulfate: -0.04118437258768767
Skewness of Conductivity: 0.2644902239264119
Skewness of Organic_carbon: 0.025532582090242492
Skewness of Trihalomethanes: -0.08516101983758483
Skewness of Turbidity: -0.007816642377392604
Skewness of Potability: 0.4507837491078628

Observation¶

Applying the thumb rule in the above data it can be said that our data is fairly symmetrical. That's the reason we are not going for skew correction.

3.2. Feature Transformation¶

In the context of the given problem, feature engineering techniques such as standardization and normalization can be applied to preprocess the data and improve the performance of machine learning models.

1. Standardization (Z-score normalization):
Justification: Standardization is a popular technique that transforms the data to have zero mean and unit variance. It brings all features to a similar scale, which can be beneficial for models that are sensitive to the scale of the features, such as linear regression, logistic regression, and support vector machines.

Usage: Standardization is typically applied when the features have different scales and units, and when the distribution of the data is approximately Gaussian or at least not heavily skewed. We are going with this because:

  1. The distribution of the data is approximately Gaussian or not heavily skewed.
  2. The features have different scales and units.
  3. We want to bring the features to a similar scale.
In [20]:
# Select the features to be standardized
features_to_standardize = ['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity', 'Organic_carbon', 'Trihalomethanes', 'Turbidity']

# Perform standardization
scaler = StandardScaler()
data_standardized = scaler.fit_transform(df[features_to_standardize])
data_standardized = pd.DataFrame(data_standardized, columns=features_to_standardize)

# Visualize the standardized distributions
plt.figure(figsize=(12, 6))
for i, feature in enumerate(data_standardized.columns):
    plt.subplot(3, 3, i+1)
    sns.histplot(data_standardized[feature], kde=True)
    plt.xlabel(feature + " (Standardized)")
    plt.ylabel("Frequency")

plt.tight_layout()
plt.show()
In [21]:
df.head()
Out[21]:
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity Potability
0 7.080795 204.890455 20791.318981 7.300212 368.516441 564.308654 10.379783 86.990970 2.963135 0
1 3.716080 129.422921 18630.057858 6.635246 333.775777 592.885359 15.180013 56.329076 4.500656 0
2 8.099124 224.236259 19909.541732 9.275884 333.775777 418.606213 16.868637 66.420093 3.055934 0
3 8.316766 214.373394 22018.417441 8.059332 356.886136 363.266516 18.436524 100.341674 4.628771 0
4 9.092223 181.101509 17978.986339 6.546600 310.135738 398.410813 11.558279 31.997993 4.075075 0
In [22]:
df.describe()
Out[22]:
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity Potability
count 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000 3276.000000
mean 7.080795 196.369496 22014.092526 7.122277 333.775777 426.205111 14.284970 66.396293 3.966786 0.390110
std 1.469956 32.879761 8768.570828 1.583085 36.142612 80.824064 3.308162 15.769881 0.780382 0.487849
min 0.000000 47.432000 320.942611 0.352000 129.000000 181.483754 2.200000 0.738000 1.450000 0.000000
25% 6.277673 176.850538 15666.690297 6.127421 317.094638 365.734414 12.065801 56.647656 3.439711 0.000000
50% 7.080795 196.967627 20927.833607 7.130299 333.775777 421.884968 14.218338 66.396293 3.955028 0.000000
75% 7.870050 216.667456 27332.762127 8.114887 350.385756 481.792304 16.557652 76.666609 4.500320 1.000000
max 14.000000 323.124000 61227.196008 13.127000 481.030642 753.342620 28.300000 124.000000 6.739000 1.000000

Observation¶

From the above "Standardization" has impacted the data by transforming it based on two statistical parameters: the mean and the standard deviation:
1. Mean:
Before standardization: The mean of the original data represents the average value of the feature. It can be positive, negative, or zero.
After standardization: The mean of the standardized data becomes zero. This means that the distribution is centered around zero.
2. Standard Deviation:

Before standardization: The standard deviation of the original data measures the spread or variability of the feature values.
After standardization: The standard deviation of the standardized data becomes 1. This ensures that the spread of the data points is similar and facilitates comparison across different features.

4. Model Building¶

4.1 Splitting the data¶

Splitting data based on:
Dataset Size: The size of the dataset plays a role in determining the split ratio. We have enough data points to decide that 70:30 split would be a better approach. However, we have also tested our model with 80:20 and 5-fold validationd and found that the results are quite similar. So we are going with the standard 70:30

70:30 Train-Test Split Scores & Logistic Regression

In [23]:
# Separate the features and target variable
X = df.drop('Potability', axis=1)
y = df['Potability']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the sizes of the training and test sets
print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])
Training set size: 2293
Test set size: 983

4.2 Model building with L1&L2 Values¶

In [24]:
# Define the list of C values to be evaluated
C_values = [1, 0.5, 0.1, 0.01, 0.003]

# Perform model development and evaluation
best_score = 0
best_C = None
best_penalty = None

for penalty in ['l1', 'l2']:
    for C in C_values:
        # Build the Logistic Regression model
        lr = LogisticRegression(penalty=penalty, C=C, solver='liblinear')
        lr.fit(X_train, y_train)

        # Evaluate the model
        score = lr.score(X_test, y_test)

        # Check if the current model is the best so far
        if score > best_score:
            best_score = score
            best_C = C
            best_penalty = penalty

# Print the best parameter and its corresponding accuracy score
print("Best Penalty:", best_penalty)
print("Best C:", best_C)
print("Accuracy:", best_score)
Best Penalty: l1
Best C: 1
Accuracy: 0.6276703967446592

Justification¶

The choice of the best parameter is based on the highest accuracy score achieved on the test set. In this case, the model with L1 penalty and C=1 yielded the highest accuracy among all the combinations evaluated.

L1 penalty (Lasso regularization) encourages sparse solutions by shrinking less important features towards zero, effectively performing feature selection. It can be useful when there are many features and some of them are irrelevant or have little impact on the target variable.

The chosen value of C, which represents the inverse of regularization strength, was 1. A larger value of C indicates weaker regularization, allowing the model to focus more on fitting the training data.

5. Performance Evaluation¶

5.1 Prediction for the test data & evaluation metrics**¶

In [25]:
# Separate the features and target variable
X = df.drop('Potability', axis=1)
y = df['Potability']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Build the Logistic Regression model with the best parameters
lr = LogisticRegression(penalty='l1', C=1, solver='liblinear')
lr.fit(X_train, y_train)

# Perform predictions on the test data
y_pred = lr.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("ROC AUC:", roc_auc)

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Reds')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Class")
plt.ylabel("True Class")
plt.show()

# Calculate the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
Accuracy: 0.6276703967446592
Precision: 0.0
Recall: 0.0
F1-Score: 0.0
ROC AUC: 0.5
C:\Users\jaspr\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\sklearn\metrics\_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Observation¶

In this case, the model did not predict any instances as positive (potable water) in the test set, resulting in zero true positives and false negatives. This leads to an accuracy, precision, recall, and F1-score of 0 for the positive class.

Accuracy: The accuracy score from the 70:30 train-test split is approximately 0.628, indicating that the model correctly predicts the class label for around 63% of the instances in the test set.

Precision: The precision score from the 70:30 split is 0, indicating that the model did not predict any positive instances correctly in the test set.

Recall: The recall score from the 70:30 split is 0, suggesting that the model failed to identify any positive instances in the test set.

F1-Score: The F1-score from the 70:30 split is 0, reflecting the lack of both precision and recall for the positive class.

5.2 Commenting on under fitting/overfitting/just right model¶

In [26]:
print("Accuracy on training set: {:.3f}".format(lr.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(lr.score(X_test, y_test)))
Accuracy on training set: 0.602
Accuracy on test set: 0.628

From the above accuracy data we can say that the model is close to underfitting or in other words, it is close to just right model.\

Justification¶

The model is unable to capture the realtionship between predictors and target. And thus, giving a poor training score.